Introduction

Row

The Problem & Data Collection

The Problem

This report will analyze NBA team performance data to determine which factors best predict regular season wins and playoff births.

The Data

This dataset has 90 rows and 18 variables. For this analysis, we will ignore the Year variable as it is not relevant for the analysis.

The Data

VARIABLES TO PREDICT WITH

  • PTS: Points Per Game
  • FGM: Field Goals Made
  • FGA: Field Goals Attempted
  • FGPct: Field goal Percentage
  • 3PM: 3 Pointers Made
  • 3PA: 3 Pointers Attempted
  • 3Pct: 3 Point Percentage
  • FTM: Free Throws Made
  • FTA: Free Throws Attempted
  • FTPct: Free Throw Percentage
  • OR: Offensive Rebounds
  • DR: Defensive Rebounds
  • REB: Total Rebounds
  • AST: Assists
  • STL: Steals
  • BLK: Blocks
  • TO: Turnovers
  • PF: Personal Fouls

VARIABLES WE WANT TO PREDICT

  • W: Wins
  • Playoffs: Whether the team made the playoffs (1 = Yes, 0 = No)

Data

Column

Summary Statistics

       W           PTS             FGM             FGA            FGPct      
 Min.   :14   Min.   :103.7   Min.   :37.70   Min.   :83.80   Min.   :43.00  
 1st Qu.:34   1st Qu.:110.4   1st Qu.:40.50   1st Qu.:86.47   1st Qu.:46.12  
 Median :44   Median :113.2   Median :41.75   Median :88.45   Median :47.00  
 Mean   :41   Mean   :113.2   Mean   :41.59   Mean   :88.44   Mean   :47.03  
 3rd Qu.:49   3rd Qu.:115.8   3rd Qu.:42.88   3rd Qu.:90.05   3rd Qu.:48.08  
 Max.   :64   Max.   :123.3   Max.   :47.00   Max.   :94.40   Max.   :50.70  
      3PM             3PA             3Pct            FTM       
 Min.   :10.40   Min.   :28.80   Min.   :32.30   Min.   :14.50  
 1st Qu.:11.50   1st Qu.:32.25   1st Qu.:34.90   1st Qu.:16.32  
 Median :12.45   Median :34.20   Median :36.05   Median :17.50  
 Mean   :12.54   Mean   :34.83   Mean   :35.98   Mean   :17.45  
 3rd Qu.:13.28   3rd Qu.:36.98   3rd Qu.:36.98   3rd Qu.:18.50  
 Max.   :16.60   Max.   :43.20   Max.   :38.90   Max.   :21.00  
      FTA            FTPct             OR               DR       
 Min.   :18.40   Min.   :71.30   Min.   : 7.600   Min.   :30.10  
 1st Qu.:21.20   1st Qu.:76.03   1st Qu.: 9.525   1st Qu.:32.40  
 Median :22.35   Median :78.15   Median :10.350   Median :33.30  
 Mean   :22.37   Mean   :78.03   Mean   :10.438   Mean   :33.37  
 3rd Qu.:23.60   3rd Qu.:79.58   3rd Qu.:11.200   3rd Qu.:34.17  
 Max.   :26.60   Max.   :83.50   Max.   :14.100   Max.   :37.50  
      REB             AST             STL             BLK       
 Min.   :38.80   Min.   :21.90   Min.   :6.100   Min.   :3.000  
 1st Qu.:42.75   1st Qu.:24.00   1st Qu.:7.025   1st Qu.:4.425  
 Median :43.85   Median :25.35   Median :7.400   Median :4.700  
 Mean   :43.81   Mean   :25.55   Mean   :7.467   Mean   :4.842  
 3rd Qu.:45.08   3rd Qu.:27.00   3rd Qu.:7.800   3rd Qu.:5.200  
 Max.   :49.20   Max.   :30.80   Max.   :9.800   Max.   :6.600  
       TO              PF       
 Min.   :11.20   Min.   :15.60  
 1st Qu.:12.40   1st Qu.:18.60  
 Median :13.05   Median :19.65  
 Mean   :13.13   Mean   :19.45  
 3rd Qu.:13.80   3rd Qu.:20.40  
 Max.   :15.70   Max.   :22.10  

The dataset summarizes statistics for NBA teams. The number of wins ranges from 14 to 64, while points per game range from 103.7 to 123.3. Field goals made (FGM) and attempted (FGA) vary, with percentages (FGPct) from 43% to 50.7%. Three-pointers made (3PM) and attempted (3PA) also show significant variation, with corresponding percentages (3Pct) from 32.3% to 38.9%. Free throws made (FTM) and attempted (FTA) display consistent averages, with free throw percentages (FTPct) between 71.3% and 83.5%. Rebounding statistics (OR, DR, REB), assists (AST), steals (STL), blocks (BLK), turnovers (TO), and personal fouls (PF) indicate typical ranges for team performance in these areas.

Data Viz #1

Column

Scatterplot & Correlation Between FGPct and Wins

Row ———————————————————————–

Visualization Summary

This visualization is a scatterplot matrix displaying the relationship between field goal percentage (FGPct) and the number of wins (W).

  • Correlation: The correlation coefficient between FGPct and W is 0.584, indicating a moderate positive relationship.
  • Scatterplot: The bottom-left scatterplot shows individual data points of FGPct versus Wins, supporting the correlation by displaying a general upward trend.

Overall, the data suggests that teams with higher field goal percentages tend to have more wins.

Data Viz #2

Column

Field Goal, 3 Point, and Free Throw % Between Playoff and NonPlayoff Teams

Row ———————————————————————–

Visualization Summary

This visualization is a bar graph that displays the Average Field Goal Percentage, Three Point Percentage, and Free Throw Percentage. With the 1 on the x axis representing the averages of the teams making the playoffs, it is clear that teams with higher Average Field Goal Percentage, Three Point Percentage, and Free Throw Percentage are more likely to make the playoffs.

Data Viz #3

Column

Average Points Per Season From 2021-2023

Row ———————————————————————–

Visualization Summary

This visualization is a bar graph that displays average PPG (points per game) from every year in the dataset. When looking at the graph, it is clear that the average points per game being scored by teams has increased since 2021. As a result of this, it is clear that scoring is more important now then it was in 2021 which means that teams need to prioritize high scoring players in order to maximize their wins.

Linear Regression Model

Row

Predict Wins

For this analysis we will use a Linear Regression Model to predict Wins based on the predictors listed below.

Adjusted R-Squared

85 %

RMSE

4.46

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
TO -4.341 0.643 -6.747 0.000
STL 5.001 0.835 5.988 0.000
OR 17.574 11.303 1.555 0.124
DR 17.136 11.171 1.534 0.129
PF 0.709 0.476 1.490 0.141
PTS -10.959 7.795 -1.406 0.164
REB -13.715 11.169 -1.228 0.224
FGM 15.089 14.200 1.063 0.292
FTM 11.245 12.307 0.914 0.364
FGPct 10.047 11.429 0.879 0.382
(Intercept) -542.916 637.618 -0.851 0.397
3PM 8.435 11.210 0.752 0.454
3Pct 2.153 3.364 0.640 0.524
3PA 1.770 3.558 0.497 0.620
FGA 1.234 6.059 0.204 0.839
AST 0.078 0.516 0.150 0.881
FTPct 0.388 2.703 0.144 0.886
BLK -0.090 0.703 -0.128 0.898
FTA -0.293 9.304 -0.031 0.975

Row

Analysis Summary

After examining this model, 85% of the variability can be explained by this model and the only predictors that are significant at an alpha of 0.05 are TO (Turnovers) & STL (Steals). The model is reasonably accurate at prediciting number of wins as the RMSE is 4.46 which means its on average only off by about 4 wins.

Stepwise Regression Model

Column

Stepwise Regression Model predicting number of wins

Analysis Summary

After running the forward stepwise regression model, the amount of variance explained by the model was 86.73% and the RMSE is about 4.46 which is over a scale of 82 NBA games which means the model does a reasonably good job of predicting a team’s wins.

The predictor variables that are significant at a 0.05 alpha are the following:

  • DR (Defensive Rebounds)
  • TO (Turnovers)
  • FG% (Field Goal Percentage)
  • STL (Steals)
  • FGA (Field Goals Attempted)
  • OR (Offensive Rebounds)
  • 3PA (3 Pointers Attempted)
  • 3P% (3 Point Percentage)

Of these variables, the variance that seems to have the greatest impact on winning based on a 1 unit change is defense rebounds followed closely by turnovers. This makes sense as both of these statistic measures indicate that there was a change in possession which creates an opportunity for a team to score.

Bootstrap Forest Model

Column

Bootstrap Forest Model predicting whether a team will/will not make the playoffs.

Analysis Summary

The model shows a reasonable fit on the training data (Entropy RSquare of 0.3914, Generalized RSquare of 0.5558) but performs significantly worse on the validation data (Entropy RSquare of 0.1183, Generalized RSquare of 0.1994). This suggests potential overfitting. The higher misclassification rate on the validation set (0.3182) compared to the training set (0.1176) also indicates overfitting.

  • The model performs well on training data but not as well on validation data (indicating overfitting)
  • The confusion matrices reveal that the model struggles to distinguish between the classes, particularly in predicting the 0 class.

K Nearest Neighbors Model

Column

K Nearest Neighbors Model predicting whether a team will/will not make the playoffs.

Analysis Summary

  • For K=6, the training misclassification rate is 0.32353 (22 misclassifications), while the validation misclassification rate is 0.22727 (5 misclassifications). This suggests that the model performs reasonably well on both the training and validation datasets.

  • The confusion matrices suggest that while the model performs well in distinguishing between some classes, there is still a significant number of misclassifications, particularly for class 1.

Conclusion

Column

Playoff Prediction Accuracy

When looking at Model 1 & 2 which are predicting whether a team makes the playoffs:

  • M1 BootF has a higher sensitivity (77.78%), indicating it is better at correctly identifying positive instances (playoffs = 1). However, it has a higher error rate (31.82%) compared to M2 KNN.
  • M2 KNN shows a lower error rate (22.73%), suggesting it is more accurate overall, but it has a lower sensitivity (66.67%).

Wins Prediction Accuracy

When looking at Models 3, 4, & 5 which are predicting number of wins:

The best model is Model 3 (M3 LinReg)

  • It has the highest RSquare
  • It has the lowest RASE
  • It has the lowest AAE

Analysis Wrap Up

After creating and running many models, it is clear that the best model for predicting number of wins is Model 3: Linear Regression (M3 LinReg). When it comes to predicting whether a team makes the playoffs, the best model comes down to preference. If you want the best model for predicting whether a team will make the playoffs then the best choice is Model 1 (Bootstrap Forest). If you want the best model for overall accuracy, then the best choice is Model 2 (K Nearest Neighbors).

Significant Variables

The variables that are significant at a 95% confidence interval and should be used in further analysis in predicting wins are the following:

  • DR (Defensive Rebounds)
  • TO (Turnovers)
  • FG% (Field Goal Percentage)
  • STL (Steals)
  • FGA (Field Goals Attempted)
  • OR (Offensive Rebounds)
  • 3PA (3 Pointers Attempted)
  • 3P% (3 Point Percentage)
---
title: "Predicting NBA Team Success"
output: 
  flexdashboard::flex_dashboard:
    vertical_layout: scroll
    source_code: embed
---

```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```

```{r load_data}
df <- read_csv("INFO3200_ProjectDatabase.csv")
```

Introduction {data-orientation=rows}
=======================================================================

Row {data-height=650}
-----------------------------------------------------------------------

### The Problem & Data Collection

#### The Problem
This report will analyze NBA team performance data to determine which factors best predict regular season wins and playoff births.


#### The Data
This dataset has 90 rows and 18 variables. For this analysis, we will ignore the `Year` variable as it is not relevant for the analysis.


#### Data Sources
2023-24 NBA team STAT leaders ESPN. Available at: https://www.espn.com/nba/stats/team/_/season/2024/seasontype/2/table/offensive/sort/avgOffensiveRebounds/dir/desc

2022-23 NBA team STAT leaders ESPN. Available at:
https://www.espn.com/nba/stats/team/_/season/2023/seasontype/2/table/offensive/sort/avgOffensiveRebounds/dir/desc

2021-22 NBA team STAT leaders ESPN. Available at:
https://www.espn.com/nba/stats/team/_/season/2022/seasontype/2/table/offensive/sort/avgOffensiveRebounds/dir/desc

2023-24 NBA standings Basketball. Available at: https://www.basketball-reference.com/leagues/NBA_2024_standings.html

2023-24 NBA standings Basketball. Available at: https://www.basketball-reference.com/leagues/NBA_2023_standings.html

2023-24 NBA standings Basketball. Available at: https://www.basketball-reference.com/leagues/NBA_2022_standings.html

### The Data
VARIABLES TO PREDICT WITH

* *PTS*: Points Per Game
* *FGM*: Field Goals Made
* *FGA*: Field Goals Attempted 
* *FGPct*: Field goal Percentage
* *3PM*: 3 Pointers Made
* *3PA*: 3 Pointers Attempted
* *3Pct*: 3 Point Percentage
* *FTM*: Free Throws Made
* *FTA*: Free Throws Attempted 
* *FTPct*: Free Throw Percentage 
* *OR*:  Offensive Rebounds 
* *DR*:  Defensive Rebounds
* *REB*: Total Rebounds
* *AST*: Assists
* *STL*: Steals
* *BLK*: Blocks
* *TO*: Turnovers
* *PF*: Personal Fouls

VARIABLES WE WANT TO PREDICT

* *W*: Wins
* *Playoffs*: Whether the team made the playoffs (1 = Yes, 0 = No)


Data
=======================================================================

Column {data-width=400}
-----------------------------------------------------------------------
### Summary Statistics

```{r, cache=TRUE}
#the cache=TRUE can be removed. This will allow you to rerun your code without it having to run EVERYTHING from scratch every time. If the output seems to not reflect new updates, you can choose Knit, Clear Knitr cache to fix.

#Clean data by replacing spaces with decimals
#colnames(df) <- make.names(colnames(df))
#View data

#remove RAD due to it being an index so not a real continuous number
df <- select(df, -Year, -Playoffs)
summary(df)
```
The dataset summarizes statistics for NBA teams. The number of wins ranges from 14 to 64, while points per game range from 103.7 to 123.3. Field goals made (FGM) and attempted (FGA) vary, with percentages (FGPct) from 43% to 50.7%. Three-pointers made (3PM) and attempted (3PA) also show significant variation, with corresponding percentages (3Pct) from 32.3% to 38.9%. Free throws made (FTM) and attempted (FTA) display consistent averages, with free throw percentages (FTPct) between 71.3% and 83.5%. Rebounding statistics (OR, DR, REB), assists (AST), steals (STL), blocks (BLK), turnovers (TO), and personal fouls (PF) indicate typical ranges for team performance in these areas.



Data Viz #1
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------

### Scatterplot & Correlation Between FGPct and Wins

```{r, cache=TRUE} 

# Create scatterplot matrix
ggpairs(select(df, FGPct, W))
```
Row
-----------------------------------------------------------------------

### Visualization Summary

This visualization is a scatterplot matrix displaying the relationship between field goal percentage (FGPct) and the number of wins (W).

* Correlation: The correlation coefficient between FGPct and W is 0.584, indicating a moderate positive relationship.
* Scatterplot: The bottom-left scatterplot shows individual data points of FGPct versus Wins, supporting the correlation by displaying a general upward trend.

Overall, the data suggests that teams with higher field goal percentages tend to have more wins.



Data Viz #2
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Field Goal, 3 Point, and Free Throw % Between Playoff and NonPlayoff Teams

![](MultiBar.png)
Row
-----------------------------------------------------------------------

### Visualization Summary

This visualization is a bar graph that displays the Average Field Goal Percentage, Three Point Percentage, and Free Throw Percentage. With the 1 on the x axis representing the averages of the teams making the playoffs, it is clear that teams with higher Average Field Goal Percentage, Three Point Percentage, and Free Throw Percentage are more likely to make the playoffs.



Data Viz #3
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Average Points Per Season From 2021-2023

![](AvgPTSSeason.png)
Row
-----------------------------------------------------------------------

### Visualization Summary

This visualization is a bar graph that displays average PPG (points per game) from every year in the dataset. When looking at the graph, it is clear that the average points per game being scored by teams has increased since 2021. As a result of this, it is clear that scoring is more important now then it was in 2021 which means that teams need to prioritize high scoring players in order to maximize their wins.



Linear Regression Model {data-orientation=rows}
=======================================================================

Row
-----------------------------------------------------------------------

### Predict Wins
For this analysis we will use a Linear Regression Model to predict Wins based on the predictors listed below.

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
W_lm <- lm(W ~ . ,data = df)
summary(W_lm)
```


### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(W_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```

### RMSE
```{r, cache=TRUE}
Sig<-round(summary(W_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(W_lm))[,4])  
out <- coef(summary(W_lm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```

Row
-----------------------------------------------------------------------

### Analysis Summary
After examining this model, 85% of the variability can be explained by this model and the only predictors that are significant at an alpha of 0.05 are TO (Turnovers) & STL (Steals). The model is reasonably accurate at prediciting number of wins as the RMSE is 4.46 which means its on average only off by about 4 wins.



Stepwise Regression Model
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Stepwise Regression Model predicting number of wins

![](test.png)

### Analysis Summary
After running the forward stepwise regression model, the amount of variance explained by the model was 86.73% and the RMSE is about 4.46 which is over a scale of 82 NBA games which means the model does a reasonably good job of predicting a team’s wins.


The predictor variables that are significant at a 0.05 alpha are the following:

* DR (Defensive Rebounds)
* TO (Turnovers)
* FG% (Field Goal Percentage)
* STL (Steals)
* FGA (Field Goals Attempted)
* OR (Offensive Rebounds)
* 3PA (3 Pointers Attempted)
* 3P% (3 Point Percentage)

Of these variables, the variance that seems to have the greatest impact on winning based on a 1 unit change is defense rebounds followed closely by turnovers. This makes sense as both of these statistic measures indicate that there was a change in possession which creates an opportunity for a team to score.



Bootstrap Forest Model
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Bootstrap Forest Model predicting whether a team will/will not make the playoffs.

![](BootstrapForest.png)

### Analysis Summary

The model shows a reasonable fit on the training data (Entropy RSquare of 0.3914, Generalized RSquare of 0.5558) but performs significantly worse on the validation data (Entropy RSquare of 0.1183, Generalized RSquare of 0.1994). This suggests potential overfitting.
The higher misclassification rate on the validation set (0.3182) compared to the training set (0.1176) also indicates overfitting.

* The model performs well on training data but not as well on validation data (indicating overfitting)
* The confusion matrices reveal that the model struggles to distinguish between the classes, particularly in predicting the 0 class.





K Nearest Neighbors Model
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### K Nearest Neighbors Model predicting whether a team will/will not make the playoffs.

![](KNN.png)

### Analysis Summary

* For K=6, the training misclassification rate is 0.32353 (22 misclassifications), while the validation misclassification rate is 0.22727 (5 misclassifications). This suggests that the model performs reasonably well on both the training and validation datasets.

* The confusion matrices suggest that while the model performs well in distinguishing between some classes, there is still a significant number of misclassifications, particularly for class 1.




Conclusion
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
![](ModelCompare.png)

### Playoff Prediction Accuracy
When looking at Model 1 & 2 which are predicting whether a team makes the playoffs:

* M1 BootF has a higher sensitivity (77.78%), indicating it is better at correctly identifying positive instances (playoffs = 1). However, it has a higher error rate (31.82%) compared to M2 KNN.
* M2 KNN shows a lower error rate (22.73%), suggesting it is more accurate overall, but it has a lower sensitivity (66.67%).


### Wins Prediction Accuracy
When looking at Models 3, 4, & 5 which are predicting number of wins:

The best model is Model 3 (M3 LinReg)

* It has the highest RSquare
* It has the lowest RASE
* It has the lowest AAE

### Analysis Wrap Up

After creating and running many models, it is clear that the best model for predicting number of wins is Model 3: Linear Regression (M3 LinReg). When it comes to predicting whether a team makes the playoffs, the best model comes down to preference. If you want the best model for predicting whether a team will make the playoffs then the best choice is Model 1 (Bootstrap Forest). If you want the best model for overall accuracy, then the best choice is Model 2 (K Nearest Neighbors).

### Significant Variables

The variables that are significant at a 95% confidence interval and should be used in further analysis in predicting wins are the following:

* DR (Defensive Rebounds)
* TO (Turnovers)
* FG% (Field Goal Percentage)
* STL (Steals)
* FGA (Field Goals Attempted)
* OR (Offensive Rebounds)
* 3PA (3 Pointers Attempted)
* 3P% (3 Point Percentage)